Carga de librerias
rm(list = ls())
# preprocessing
library(data.table)
library(dplyr)
library(lubridate)
library(tidyr)
library(lubridate)
library(stats)
library(bit64)
library(purrr)
# visualizaciones
library(ggplot2)
library(plotly)
library(cowplot)
library(kableExtra)
theme_set(theme_grey())
# models
library(h2o)
library(scorecard)
library(InformationValue)
library(ROCit)
#require(devtools)
#install_version("h2o", version = "3.30.0.1", repos = "http://cran.us.r-project.org")
cm_real <- fread('~/Desktop/test/files/costo_marginal_real.csv')
cm_prog <- fread('~/Desktop/test/files/costo_marginal_programado.csv')
# Eliminar datos con id duplicado -----------------------------------------
cm_real <- cm_real %>% group_by(barra_mnemotecnico,fecha,hora) %>% slice(1)
cm_prog <- cm_prog %>% group_by(mnemotecnico_barra,fecha,hora) %>% slice(1)
A continuación se presenta una visualizacion de las bases de datos
COSTO MARGINAL REAL
| barra_mnemotecnico | barra_referencia_mnemotecnico | fecha | hora | costo_en_dolares | costo_en_pesos | nombre |
|---|---|---|---|---|---|---|
| BA99L117SE054L117 | BA02T002SE032T002 | 2019-01-01 | 1 | 57.65 | 40.10653 | BA S/E COLLAHUASI 220KV-BP1 |
| BA99L117SE054L117 | BA02T002SE032T002 | 2019-01-01 | 2 | 57.65 | 40.10653 | BA S/E COLLAHUASI 220KV-BP1 |
| BA99L117SE054L117 | BA02T002SE032T002 | 2019-01-01 | 3 | 57.65 | 40.10653 | BA S/E COLLAHUASI 220KV-BP1 |
| BA99L117SE054L117 | BA02T002SE032T002 | 2019-01-01 | 4 | 57.65 | 40.10653 | BA S/E COLLAHUASI 220KV-BP1 |
| BA99L117SE054L117 | BA02T002SE032T002 | 2019-01-01 | 5 | 57.65 | 40.10653 | BA S/E COLLAHUASI 220KV-BP1 |
| BA99L117SE054L117 | BA02T002SE032T002 | 2019-01-01 | 6 | 57.65 | 40.10653 | BA S/E COLLAHUASI 220KV-BP1 |
COSTO MARGINAL PROGRAMADO
| mnemotecnico_barra | nombre_barra | fecha | hora | costo |
|---|---|---|---|---|
| BA99L117SE054L117 | BA S/E COLLAHUASI 220KV-BP1 | 2019-01-01 | 1 | 130.35093 |
| BA99L117SE054L117 | BA S/E COLLAHUASI 220KV-BP1 | 2019-01-01 | 2 | 131.10417 |
| BA99L117SE054L117 | BA S/E COLLAHUASI 220KV-BP1 | 2019-01-01 | 3 | 54.79902 |
| BA99L117SE054L117 | BA S/E COLLAHUASI 220KV-BP1 | 2019-01-01 | 4 | 56.08049 |
| BA99L117SE054L117 | BA S/E COLLAHUASI 220KV-BP1 | 2019-01-01 | 5 | 54.79958 |
| BA99L117SE054L117 | BA S/E COLLAHUASI 220KV-BP1 | 2019-01-01 | 6 | 56.08049 |
Para ello se hara un left join de las base de datos de costo marginal real y costo marginal programado usando como llave primaria el mnemotecnico de las barras, la fechas y la hora.
costo_marginal <- cm_real %>% left_join(cm_prog,by=c('barra_mnemotecnico'='mnemotecnico_barra','fecha','hora')) %>% data.table()
La base de datos resultante esta dada por:
| barra_mnemotecnico | barra_referencia_mnemotecnico | fecha | hora | costo_en_dolares | costo_en_pesos | nombre | nombre_barra | costo |
|---|---|---|---|---|---|---|---|---|
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 1 | 50.41 | 35.06973 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.11629 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 2 | 50.41 | 35.06973 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.52284 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 3 | 49.40 | 34.36709 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.11950 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 4 | 48.56 | 33.78271 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 49.85409 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 5 | 46.51 | 32.35654 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 49.85409 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 6 | 46.21 | 32.14783 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 42.10964 |
Histograma Variables
hist <- function(df,X,title,xname){
ggplot(df, aes(x=X))+geom_histogram(color="black", fill="white")+
ggtitle(paste0("Histograma ",title))+xlab(xname)
}
h1 <- hist(costo_marginal,costo_marginal$costo_en_pesos,"CM real en pesos","costo_en_pesos")
h2 <- hist(costo_marginal,costo_marginal$costo,"CM programado en pesos","costo")
plot_grid(h1,h2)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Según el histograma tanto el costro marginal real como el programado poseen una distribución asimétrica positiva.
Summary
costo_marginal %>% select(costo,costo_en_dolares,costo_en_pesos) %>% summary()
## costo costo_en_dolares costo_en_pesos
## Min. :-12 Min. : 0.00 Min. : 0.00
## 1st Qu.: 50 1st Qu.: 50.65 1st Qu.: 34.11
## Median : 55 Median : 56.09 Median : 37.99
## Mean : 62 Mean : 63.71 Mean : 43.00
## 3rd Qu.: 64 3rd Qu.: 65.90 3rd Qu.: 44.72
## Max. :611 Max. :1109.97 Max. :734.36
## NA's :3409022
Por otro lado podemos visualizar que tanto la mediana como la media de los costros programados y reales son similares, sin embargo el costo programado no se encuentra disponible para todas las observaciones del dataset.
En la siguiente tabla podemos observar el número de observaciones para las cuales existe un costo programado y el porcentaje del total al cual corresponden dichas observaciones:
Número de barras para las cual se programa el costo
| Núm obs total | Núm obs con costo programado | % Casos con costo programado |
|---|---|---|
| 4309951 | 900929 | 20.9 |
A continuación se presenta el analisis de la distribución del número de veces que se repite cada tipo de barra en el dataset:
dist_barra <- costo_marginal[,.("N"=.N),by="barra_mnemotecnico"]
setorder(dist_barra,-N)
dist_barra$N %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 720 4345 4345 4225 4345 4345
Se puede visualizar que la minima cantidad de veces que se repite una barra en el dataset son 720 veces la cual corresponde a la barra BA01G460SE002G460. Por otro lado, las barra que más se repiten lo hacen 8690 veces y corresponde a las barras BA01T002SE036T002 y BA02T003SE004T003. Además la cantidad promedio de repeticiones por barra corresponde a 4238, por otro lado tan solo un 25% de las barras tienen menos de 4345 repeticiones.
Tras realizar generar las variables solicitadas se obtuvo el siguiente data frame:
| barra_mnemotecnico | barra_referencia_mnemotecnico | fecha | hora | costo_en_dolares | costo_en_pesos | nombre | nombre_barra | costo | desviacion | desviacion_pct | desviacion_cat |
|---|---|---|---|---|---|---|---|---|---|---|---|
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 1 | 50.41 | 35.06973 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.11629 | -15.046556 | -0.4290468 | 1 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 2 | 50.41 | 35.06973 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.52284 | -15.453106 | -0.4406394 | 1 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 3 | 49.40 | 34.36709 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.11950 | -15.752415 | -0.4583575 | 1 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 4 | 48.56 | 33.78271 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 49.85409 | -16.071377 | -0.4757279 | 1 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 5 | 46.51 | 32.35654 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 49.85409 | -17.497547 | -0.5407731 | 1 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 6 | 46.21 | 32.14783 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 42.10964 | -9.961809 | -0.3098750 | 1 |
De forma adicional se contruiran las variables camada y date_ymdh ya que seran utiliadas para los puntos siguientes.
costo_marginal[,c("camada","date_ymdh"):=.(as.Date(fecha,'%Y-%m-%d') %>% format('%Y-%m-01') %>% as.Date(),
as.POSIXct(as.character(fecha),"UTC")+hours(hora))]
Visualización nueva base de datos:
| barra_mnemotecnico | barra_referencia_mnemotecnico | fecha | hora | costo_en_dolares | costo_en_pesos | nombre | nombre_barra | costo | desviacion | desviacion_pct | desviacion_cat | camada | date_ymdh |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 1 | 50.41 | 35.06973 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.11629 | -15.046556 | -0.4290468 | 1 | 2019-01-01 | 2019-01-01 01:00:00 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 2 | 50.41 | 35.06973 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.52284 | -15.453106 | -0.4406394 | 1 | 2019-01-01 | 2019-01-01 02:00:00 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 3 | 49.40 | 34.36709 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 50.11950 | -15.752415 | -0.4583575 | 1 | 2019-01-01 | 2019-01-01 03:00:00 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 4 | 48.56 | 33.78271 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 49.85409 | -16.071377 | -0.4757279 | 1 | 2019-01-01 | 2019-01-01 04:00:00 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 5 | 46.51 | 32.35654 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 49.85409 | -17.497547 | -0.5407731 | 1 | 2019-01-01 | 2019-01-01 05:00:00 |
| BA01G004SE001T011 | BA02T002SE032T002 | 2019-01-01 | 6 | 46.21 | 32.14783 | BA S/E CANDELARIA B1 - 220KV | BA S/E CANDELARIA B1 - 220KV | 42.10964 | -9.961809 | -0.3098750 | 1 | 2019-01-01 | 2019-01-01 06:00:00 |
Gráfica de barras desviación cat por camada
df_plot <- costo_marginal[!is.na(desviacion_cat),.("n"=.N),by=c("camada","desviacion_cat")]
ggplot(df_plot, aes(fill=desviacion_cat, y=n, x=camada)) + geom_bar(position="stack", stat="identity")+ggtitle("Distribución desviacion_cat por camada")
Se puede visualizar que a lo largo de las camadas la moda es una desviación entre el costo real y programado mayor al 15% en aquellas observaciones que poseen un costo programado, las cuales corresponden a un 20.9% del dataset
Para analizar si existe alguna tendencia temporal de la desviación categórica, gráficaremos la serie de tiempo por camada de los la proporción de casos cuya desviación entre el costo real y programado fue inferior al 15% (desc_cat igual a cero).
Serie de tiempo proporción de casos con una desviacion entre el costo real y programado es inferior al 15%
df_plot <- costo_marginal %>% mutate(x=1,desviacion_cat=paste0("desv_",desviacion_cat))
df_plot <- dcast(df_plot, fecha ~ desviacion_cat, value.var = "x" ,fun.aggregate=sum,na.rm=TRUE)
df_plot <- df_plot %>% mutate(p=(desv_0/(desv_1+desv_0)))
ggplot(df_plot , aes(x=fecha, y=p)) + geom_line(col="blue") + ggtitle(" Serie de Tiempo prop desviacion absoluta menor al 15%")
Se puede visualizar una tendencia a la baja a lo largo de los meses, exto indicaría que la porporcion de casos con una desviacion entre el costo real y programado es superior al 15% se encuentra al alza.
A continuación se presenta el script de la función time_plot_costo_barra:
time_plot_costo_barra <- function(codigo_barra,fecha_inicial,fecha_final,df=costo_marginal){
df <- df %>% filter(barra_mnemotecnico==codigo_barra & date_ymdh>=as.Date(fecha_inicial) & date_ymdh<=as.Date(fecha_final))
df <- df %>% select(date_ymdh,costo,costo_en_pesos) %>% gather("cm", "value", -date_ymdh)
ggplot(df, aes(x=date_ymdh, y=value, col=cm)) + geom_line() + ggtitle(paste0("Serie de tiempo costo marginal real y programado ",codigo_barra))
}
Las gráficas solicitadas se presentan a continuación:
codigos <- c("BA01T002SE036T002","BA02T003SE004T003","BA83L131SE134L131","BA01G004SE008G004")
plots <- map(codigos,~time_plot_costo_barra(.x,"2019-01-24","2019-07-01"))
plot_grid(plots[[1]],plots[[2]],plots[[3]],plots[[4]])
Se puede visualizar en que en aquellas barras que poseen un costo programado, en general este es superior al costo real.
Eliminación barra con cmg_real igual a cero durante todos los días
costo_marginal[,sum_cmg_real:=sum(costo_en_pesos),by="barra_mnemotecnico"]
costo_marginal <- costo_marginal[sum_cmg_real!=0,]
Visulización del dataset:
| nemotecnico_se | fecha | hora | gen_eolica_total_mwh | gen_geotermica_total_mwh | gen_hidraulica_total_mwh | gen_solar_total_mwh | gen_termica_total_mwh | cmg_real | cmg_prog | cmg_desv | cmg_desv_pct | n_barras | demanda_mwh | cap_inst_mw |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE031G216 | 2019-01-04 00:00:00 UTC | 1 | NA | NA | NA | NA | 0 | 56.2 | 55.62785 | 0.57 | 1.02 | 2 | 1210767 | 13.20785 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 2 | NA | NA | NA | NA | 0 | 56.2 | 55.37664 | 0.82 | 1.48 | 2 | 113232 | 13.20785 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 3 | NA | NA | NA | NA | 0 | 56.2 | 59.53189 | -3.33 | -5.59 | 2 | 1089415 | 13.20785 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 4 | NA | NA | NA | NA | 0 | 56.2 | 174.37892 | -118.18 | -67.77 | 2 | 1096867 | 13.20785 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 5 | NA | NA | NA | NA | 0 | 56.2 | 172.82031 | -116.62 | -67.48 | 2 | 1071851 | 13.20785 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 6 | NA | NA | NA | NA | 0 | 56.2 | 172.82031 | -116.62 | -67.48 | 2 | 1038729 | 13.20785 |
La llave primaria de las observaciones del dataset se compone por las variables nemotecnico_se - fecha - hora, por lo demás el resto de las variables corresponden a variables de tipo númericas.
Dimensiones dataset
| num filas | num col |
|---|---|
| 112779 | 15 |
df_pred[,c("anno","mes","semana","dia","dia_semana"):=list(lubridate::year(fecha),lubridate::month(fecha),
lubridate::week(fecha),lubridate::day(fecha),
base::weekdays(fecha %>% as.Date()))]
df_pred[,"weekend":=ifelse(dia_semana %in% c("Saturday","Sunday"),1,0)]
El dataset resultante esta dado por:
| nemotecnico_se | fecha | hora | gen_eolica_total_mwh | gen_geotermica_total_mwh | gen_hidraulica_total_mwh | gen_solar_total_mwh | gen_termica_total_mwh | cmg_real | cmg_prog | cmg_desv | cmg_desv_pct | n_barras | demanda_mwh | cap_inst_mw | anno | mes | semana | dia | dia_semana | weekend |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE031G216 | 2019-01-04 00:00:00 UTC | 1 | NA | NA | NA | NA | 0 | 56.2 | 55.62785 | 0.57 | 1.02 | 2 | 1210767 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 2 | NA | NA | NA | NA | 0 | 56.2 | 55.37664 | 0.82 | 1.48 | 2 | 113232 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 3 | NA | NA | NA | NA | 0 | 56.2 | 59.53189 | -3.33 | -5.59 | 2 | 1089415 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 4 | NA | NA | NA | NA | 0 | 56.2 | 174.37892 | -118.18 | -67.77 | 2 | 1096867 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 5 | NA | NA | NA | NA | 0 | 56.2 | 172.82031 | -116.62 | -67.48 | 2 | 1071851 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 6 | NA | NA | NA | NA | 0 | 56.2 | 172.82031 | -116.62 | -67.48 | 2 | 1038729 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 |
plot_daily_series_sub_estation <- function(dates,subestacion,variable,df=df_pred){
plot_sub_estation <- function(date,subestacion,variable,df=df_pred){
df <- df[fecha==as.Date(date) & nemotecnico_se==subestacion,]
df[,date_ymdh:=as.POSIXct(as.character(fecha),"UTC")+hours(hora)]
ggplot(df , aes(x=date_ymdh, y=get(variable))) + geom_line(color="blue") + ggtitle(paste0(" Serie de Tiempo subestacion: ",subestacion,' - Fecha:',date))
}
map(dates,~plot_sub_estation(.x,subestacion,variable))
}
De forma adicional se añadira la función:
plot_ts_sub_estacion <- function(date_min,date_max,subestacion,variable,df=df_pred){
df <- df[fecha<=as.Date(date_max) & fecha>=as.Date(date_min) & nemotecnico_se==subestacion,]
df[,date_ymdh:=as.POSIXct(as.character(fecha),"UTC")+hours(hora)]
ggplot(df , aes(x=date_ymdh, y=get(variable))) + geom_line(color="#69b3a2") + ggtitle(paste0(" Serie de Tiempo ",variable," subestacion: ",subestacion))
}
Sub estación SE005T002
plot_ts_sub_estacion('2019-01-10','2019-01-14','SE005T002','gen_solar_total_mwh')
Se puede visualizar que la generación térmica de la subestación térmica posee una conducta estacional, presentando valores superiores a cero entre las 9 y 21 horas aproximadamente.
De forma analoga se utilizará la función solicitada en el punto anterior.
plots <- plot_daily_series_sub_estation(c(as.Date('2019-01-10')+days(1:4)),'SE005T002','gen_solar_total_mwh')
plot_grid(plots[[1]],plots[[2]],plots[[3]],plots[[4]])
Sub estación SE127T005
plot_ts_sub_estacion('2019-01-10','2019-01-14','SE127T005','gen_solar_total_mwh')
La subestación SE127T005 presenta una estacionalidad menos marcada que la subestación SE005T002, por otro lado en términos de magnitudes su generación térmica es mucho menor.
De forma analoga se utilizará la función solicitada en el punto anterior.
plots <- plot_daily_series_sub_estation(c(as.Date('2019-01-10')+days(1:4)),'SE127T005','gen_solar_total_mwh')
plot_grid(plots[[1]],plots[[2]],plots[[3]],plots[[4]])
Sub estación SE020G213
plot_ts_sub_estacion('2019-05-14','2019-05-17','SE020G213','gen_termica_total_mwh')
A diferencia de la generación solar no existe una estacionalidad aparente en la serie de tiempo.
De forma analoga se utilizará la función solicitada en el punto anterior.
plots <- plot_daily_series_sub_estation(c(as.Date('2019-05-14')+days(1:3)),'SE020G213','gen_termica_total_mwh')
plot_grid(plots[[1]],plots[[2]],plots[[3]])
Sub estación SE106G216
plot_ts_sub_estacion('2019-05-14','2019-05-17','SE106G216','gen_termica_total_mwh')
La serie de tiempo presenta una observación posiblemente atipica durante una hora del día 15 de mayo, si se excluyese dicha observación esta presentaria un comportamiento homocedástico.
De forma analoga se utilizará la función solicitada en el punto anterior.
plots <- plot_daily_series_sub_estation(c(as.Date('2019-05-14')+days(1:3)),'SE106G216','gen_termica_total_mwh')
plot_grid(plots[[1]],plots[[2]],plots[[3]])
Dicha variable sera denominada “flag”
df_pred[,flag:=ifelse(abs(cmg_desv_pct)>15,1,0)]
Luego, debemos considerar que habrian observaciones que podrían generar ruido en nuestro modelo:
Por lo tanto incluiremos una corrección de la variable flag:
df_pred[,flag:=ifelse((is.na(cmg_real) | cmg_real==0 | is.na(cmg_prog) | cmg_prog==0),NA,flag)]
Visualización del dataset:
| nemotecnico_se | fecha | hora | gen_eolica_total_mwh | gen_geotermica_total_mwh | gen_hidraulica_total_mwh | gen_solar_total_mwh | gen_termica_total_mwh | cmg_real | cmg_prog | cmg_desv | cmg_desv_pct | n_barras | demanda_mwh | cap_inst_mw | anno | mes | semana | dia | dia_semana | weekend | flag |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE031G216 | 2019-01-04 00:00:00 UTC | 1 | NA | NA | NA | NA | 0 | 56.2 | 55.62785 | 0.57 | 1.02 | 2 | 1210767 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 | 0 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 2 | NA | NA | NA | NA | 0 | 56.2 | 55.37664 | 0.82 | 1.48 | 2 | 113232 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 | 0 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 3 | NA | NA | NA | NA | 0 | 56.2 | 59.53189 | -3.33 | -5.59 | 2 | 1089415 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 | 0 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 4 | NA | NA | NA | NA | 0 | 56.2 | 174.37892 | -118.18 | -67.77 | 2 | 1096867 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 | 1 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 5 | NA | NA | NA | NA | 0 | 56.2 | 172.82031 | -116.62 | -67.48 | 2 | 1071851 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 | 1 |
| SE031G216 | 2019-01-04 00:00:00 UTC | 6 | NA | NA | NA | NA | 0 | 56.2 | 172.82031 | -116.62 | -67.48 | 2 | 1038729 | 13.20785 | 2019 | 1 | 1 | 4 | Friday | 0 | 1 |
Antes de proceder con la creación de features, se completaran las fechas en formato y-m-d h:m:s y sub estaciones que podrían faltar, con el objetivo de poder carcular más adelante nuestra variable de respuesta.
df_pred[,c("date_ymdh","original"):=list(as.POSIXct(as.character(fecha),"UTC")+hours(hora),1)]
all_dates <- seq(from=as.POSIXct(as.character(min(df_pred$fecha)),"UTC"),to=as.POSIXct(as.character(max(df_pred$fecha)),"UTC"), by="hour")
dt <- tidyr::crossing(nemotecnico_se=unique(df_pred$nemotecnico_se),date_ymdh = all_dates) %>% data.table()
df_pred <- dt %>% left_join(df_pred,by=c('nemotecnico_se','date_ymdh')) %>% data.table()
rm(dt)
Visualización del dataset:
| nemotecnico_se | date_ymdh | fecha | hora | gen_eolica_total_mwh | gen_geotermica_total_mwh | gen_hidraulica_total_mwh | gen_solar_total_mwh | gen_termica_total_mwh | cmg_real | cmg_prog | cmg_desv | cmg_desv_pct | n_barras | demanda_mwh | cap_inst_mw | anno | mes | semana | dia | dia_semana | weekend | flag | original |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE001T002 | 2019-01-01 00:00:00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
| SE001T002 | 2019-01-01 01:00:00 | 2019-01-01 00:00:00 UTC | 1 | NA | NA | NA | 0 | 0 | 54.03 | 124.60441 | -70.57 | -56.64 | 2 | 61148556 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 1 | 1 |
| SE001T002 | 2019-01-01 02:00:00 | 2019-01-01 00:00:00 UTC | 2 | NA | NA | NA | 0 | 0 | 54.03 | 125.41552 | -71.39 | -56.92 | 2 | 136880002 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 1 | 1 |
| SE001T002 | 2019-01-01 03:00:00 | 2019-01-01 00:00:00 UTC | 3 | NA | NA | NA | 0 | 0 | 54.03 | 52.38321 | 1.65 | 3.15 | 2 | 127833826 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 0 | 1 |
| SE001T002 | 2019-01-01 04:00:00 | 2019-01-01 00:00:00 UTC | 4 | NA | NA | NA | 0 | 0 | 54.03 | 53.48202 | 0.55 | 1.03 | 2 | 133924965 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 0 | 1 |
| SE001T002 | 2019-01-01 05:00:00 | 2019-01-01 00:00:00 UTC | 5 | NA | NA | NA | 0 | 0 | 54.03 | 52.26047 | 1.77 | 3.39 | 2 | 138980453 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 0 | 1 |
Nueva dimensión del dataset:
| num filas | num col |
|---|---|
| 125336 | 24 |
Generación de variables
Generaremos variables de tipo cuadrátics y cubicas, con el objetivo de caputar variaciones no linales en nuestras covariables.
vars_num <- df_pred %>% select(starts_with("gen"),starts_with("cmg"),demanda_mwh,cap_inst_mw) %>% names()
eleva <- function(x,n){x**n}
df_pred <- df_pred %>% mutate_at(vars(all_of(vars_num)), .funs = list(cuad = ~eleva(.x,2)))
df_pred <- df_pred %>% mutate_at(vars(all_of(vars_num)), .funs = list(cub = ~eleva(.x,3)))
Luego anadiremos estadísticos descriptivos agrupados por hora de las distintas covariables, con el obtevito de que captural el escenario general de la red en nuestros features.
vars_num <- c(vars_num,paste0(vars_num,"_cuad"),paste0(vars_num,"_cub"))
df_pred <- df_pred %>% group_by(date_ymdh) %>% mutate_at(vars(all_of(vars_num)), .funs = list(mean = ~mean(.x,na.rm = TRUE)))
df_pred <- df_pred %>% group_by(date_ymdh) %>% mutate_at(vars(all_of(vars_num)), .funs = list(sd = ~sd(.x,na.rm = TRUE)))
df_pred <- df_pred %>% group_by(date_ymdh) %>% mutate_at(vars(all_of(vars_num)), .funs = list(median = ~median(.x,na.rm = TRUE)))
df_pred <- df_pred %>% group_by(date_ymdh) %>% mutate_at(vars(all_of(vars_num)), .funs = list(sum = ~sum(.x,na.rm = TRUE))) %>% data.table()
Obteniendo el siguiente dataframe:
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 1218806 65.1 6583164 351.6 17371378 927.8
## Vcells 98073086 748.3 247989252 1892.1 247988816 1892.1
| nemotecnico_se | date_ymdh | fecha | hora | gen_eolica_total_mwh | gen_geotermica_total_mwh | gen_hidraulica_total_mwh | gen_solar_total_mwh | gen_termica_total_mwh | cmg_real | cmg_prog | cmg_desv | cmg_desv_pct | n_barras | demanda_mwh | cap_inst_mw | anno | mes | semana | dia | dia_semana | weekend | flag | original | gen_eolica_total_mwh_cuad | gen_geotermica_total_mwh_cuad | gen_hidraulica_total_mwh_cuad | gen_solar_total_mwh_cuad | gen_termica_total_mwh_cuad | cmg_real_cuad | cmg_prog_cuad | cmg_desv_cuad | cmg_desv_pct_cuad | demanda_mwh_cuad | cap_inst_mw_cuad | gen_eolica_total_mwh_cub | gen_geotermica_total_mwh_cub | gen_hidraulica_total_mwh_cub | gen_solar_total_mwh_cub | gen_termica_total_mwh_cub | cmg_real_cub | cmg_prog_cub | cmg_desv_cub | cmg_desv_pct_cub | demanda_mwh_cub | cap_inst_mw_cub | gen_eolica_total_mwh_mean | gen_geotermica_total_mwh_mean | gen_hidraulica_total_mwh_mean | gen_solar_total_mwh_mean | gen_termica_total_mwh_mean | cmg_real_mean | cmg_prog_mean | cmg_desv_mean | cmg_desv_pct_mean | demanda_mwh_mean | cap_inst_mw_mean | gen_eolica_total_mwh_cuad_mean | gen_geotermica_total_mwh_cuad_mean | gen_hidraulica_total_mwh_cuad_mean | gen_solar_total_mwh_cuad_mean | gen_termica_total_mwh_cuad_mean | cmg_real_cuad_mean | cmg_prog_cuad_mean | cmg_desv_cuad_mean | cmg_desv_pct_cuad_mean | demanda_mwh_cuad_mean | cap_inst_mw_cuad_mean | gen_eolica_total_mwh_cub_mean | gen_geotermica_total_mwh_cub_mean | gen_hidraulica_total_mwh_cub_mean | gen_solar_total_mwh_cub_mean | gen_termica_total_mwh_cub_mean | cmg_real_cub_mean | cmg_prog_cub_mean | cmg_desv_cub_mean | cmg_desv_pct_cub_mean | demanda_mwh_cub_mean | cap_inst_mw_cub_mean | gen_eolica_total_mwh_sd | gen_geotermica_total_mwh_sd | gen_hidraulica_total_mwh_sd | gen_solar_total_mwh_sd | gen_termica_total_mwh_sd | cmg_real_sd | cmg_prog_sd | cmg_desv_sd | cmg_desv_pct_sd | demanda_mwh_sd | cap_inst_mw_sd | gen_eolica_total_mwh_cuad_sd | gen_geotermica_total_mwh_cuad_sd | gen_hidraulica_total_mwh_cuad_sd | gen_solar_total_mwh_cuad_sd | gen_termica_total_mwh_cuad_sd | cmg_real_cuad_sd | cmg_prog_cuad_sd | cmg_desv_cuad_sd | cmg_desv_pct_cuad_sd | demanda_mwh_cuad_sd | cap_inst_mw_cuad_sd | gen_eolica_total_mwh_cub_sd | gen_geotermica_total_mwh_cub_sd | gen_hidraulica_total_mwh_cub_sd | gen_solar_total_mwh_cub_sd | gen_termica_total_mwh_cub_sd | cmg_real_cub_sd | cmg_prog_cub_sd | cmg_desv_cub_sd | cmg_desv_pct_cub_sd | demanda_mwh_cub_sd | cap_inst_mw_cub_sd | gen_eolica_total_mwh_median | gen_geotermica_total_mwh_median | gen_hidraulica_total_mwh_median | gen_solar_total_mwh_median | gen_termica_total_mwh_median | cmg_real_median | cmg_prog_median | cmg_desv_median | cmg_desv_pct_median | demanda_mwh_median | cap_inst_mw_median | gen_eolica_total_mwh_cuad_median | gen_geotermica_total_mwh_cuad_median | gen_hidraulica_total_mwh_cuad_median | gen_solar_total_mwh_cuad_median | gen_termica_total_mwh_cuad_median | cmg_real_cuad_median | cmg_prog_cuad_median | cmg_desv_cuad_median | cmg_desv_pct_cuad_median | demanda_mwh_cuad_median | cap_inst_mw_cuad_median | gen_eolica_total_mwh_cub_median | gen_geotermica_total_mwh_cub_median | gen_hidraulica_total_mwh_cub_median | gen_solar_total_mwh_cub_median | gen_termica_total_mwh_cub_median | cmg_real_cub_median | cmg_prog_cub_median | cmg_desv_cub_median | cmg_desv_pct_cub_median | demanda_mwh_cub_median | cap_inst_mw_cub_median | gen_eolica_total_mwh_sum | gen_geotermica_total_mwh_sum | gen_hidraulica_total_mwh_sum | gen_solar_total_mwh_sum | gen_termica_total_mwh_sum | cmg_real_sum | cmg_prog_sum | cmg_desv_sum | cmg_desv_pct_sum | demanda_mwh_sum | cap_inst_mw_sum | gen_eolica_total_mwh_cuad_sum | gen_geotermica_total_mwh_cuad_sum | gen_hidraulica_total_mwh_cuad_sum | gen_solar_total_mwh_cuad_sum | gen_termica_total_mwh_cuad_sum | cmg_real_cuad_sum | cmg_prog_cuad_sum | cmg_desv_cuad_sum | cmg_desv_pct_cuad_sum | demanda_mwh_cuad_sum | cap_inst_mw_cuad_sum | gen_eolica_total_mwh_cub_sum | gen_geotermica_total_mwh_cub_sum | gen_hidraulica_total_mwh_cub_sum | gen_solar_total_mwh_cub_sum | gen_termica_total_mwh_cub_sum | cmg_real_cub_sum | cmg_prog_cub_sum | cmg_desv_cub_sum | cmg_desv_pct_cub_sum | demanda_mwh_cub_sum | cap_inst_mw_cub_sum |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| SE001T002 | 2019-01-01 00:00:00 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NA | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NA | NaN | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0.0000 | 0 | 0.0000 | 0.000 | 0.000 | 0.00000 | 0.000000 | 0 | 0.000 | 0 | 0 | 0.000 | 0 | 0.00 | 0.00 | 0.00 | 0.00000 | 0.0000 | 0 | 0 | 0 | 0 | 0.0 | 0 | 0 | 0 | 0 | 0.000000e+00 | 0.0000 | 0 | 0 |
| SE001T002 | 2019-01-01 01:00:00 | 2019-01-01 00:00:00 UTC | 1 | NA | NA | NA | 0 | 0 | 54.03 | 124.60441 | -70.57 | -56.64 | 2 | 61148556 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 1 | 1 | NA | NA | NA | 0 | 0 | 2919.241 | 15526.259 | 4980.1249 | 3208.0896 | 3739145900885136 | 204.1498 | NA | NA | NA | 0 | 0 | 157726.6 | 1934640.3 | -3.514474e+05 | -1.817062e+05 | 9223372036854775807 | 2916.913 | NaN | NaN | 23.59497 | 0 | 16.99429 | 51.64506 | 79.87444 | -28.2292308 | -22.1820513 | 6096828 | 119.1431 | NaN | NaN | 1171.4827 | 0 | 1968.727 | 2690.662 | 7893.346 | 2050.242486 | 1238.116184 | 229632446294675 | 39839.92 | NaN | NaN | 65571.93 | 0 | 305455.9 | 141247.7 | 899756.4 | -1.496800e+05 | -69829.32678 | 5648577708859038606 | 18080105 | NA | NA | 26.50628 | 0 | 42.53405 | 4.938389 | 39.673122 | 36.103838 | 27.855263 | 14147776 | 163.3114 | NA | NA | 1693.834 | 0 | 6988.854 | 483.4158 | 7063.5917 | 2646.300840 | 1585.881694 | 7.452336e+14 | 88283.97 | NA | NA | 109939.71 | 0 | 1135763.0 | 36143.32 | 1009037.99 | 1.936813e+05 | 90022.65903 | 4.335476e+18 | 48571127 | NA | NA | 14.21154 | 0 | 0.9000 | 51.29 | 50.81710 | -0.535 | -1.190 | 2306498 | 50.9994 | NA | NA | 262.6279 | 0 | 1.620000 | 2630.668 | 2582.377 | 0.8954125 | 4.049862 | 16513795856089 | 2601.662 | NA | NA | 5456.495 | 0 | 2.916000 | 134927.3 | 131228.9 | -0.2205805 | -2.650487 | 9223372036854775807 | 132757 | 0 | 0 | 188.7598 | 0 | 237.9200 | 1342.772 | 2076.735 | -733.96000 | -576.733333 | 158517551 | 3097.721 | 0 | 0 | 9371.862 | 0 | 27562.18 | 69957.22 | 205227.00 | 53306.30464 | 32191.0208 | 5970443603661567 | 1035838 | 0 | 0 | 524575.5 | 0 | 4276382 | 3672440 | 23393666 | -3.891679e+06 | -1815562.4962 | 6438656496650065239 | 470082742 |
| SE001T002 | 2019-01-01 02:00:00 | 2019-01-01 00:00:00 UTC | 2 | NA | NA | NA | 0 | 0 | 54.03 | 125.41552 | -71.39 | -56.92 | 2 | 136880002 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 1 | 1 | NA | NA | NA | 0 | 0 | 2919.241 | 15729.052 | 5096.5321 | 3239.8864 | 18736134947520004 | 204.1498 | NA | NA | NA | 0 | 0 | 157726.6 | 1972667.1 | -3.638414e+05 | -1.844143e+05 | 9223372036854775807 | 2916.913 | NaN | NaN | 23.33107 | 0 | 17.04298 | 51.64506 | 80.01715 | -28.3722436 | -22.4074359 | 12491982 | 119.1431 | NaN | NaN | 1133.9592 | 0 | 2011.270 | 2690.662 | 7912.902 | 2056.881087 | 1241.012753 | 907380952498706 | 39839.92 | NaN | NaN | 61525.12 | 0 | 316698.8 | 141247.7 | 902244.1 | -1.503705e+05 | -70009.54251 | 6160912158786026328 | 18080105 | NA | NA | 25.95866 | 0 | 43.04854 | 4.938389 | 39.630330 | 36.082860 | 27.721406 | 27953257 | 163.3114 | NA | NA | 1591.546 | 0 | 7163.939 | 483.4158 | 7065.0958 | 2653.820372 | 1587.879815 | 3.686825e+15 | 88283.97 | NA | NA | 97836.19 | 0 | 1178408.6 | 36143.32 | 1009763.40 | 1.944602e+05 | 90274.53191 | 4.267422e+18 | 48571127 | NA | NA | 14.15606 | 0 | 0.8625 | 51.29 | 51.07936 | -0.755 | -1.685 | 5244464 | 50.9994 | NA | NA | 261.9214 | 0 | 1.487812 | 2630.668 | 2609.116 | 1.1474500 | 5.176250 | 27504402647296 | 2601.662 | NA | NA | 5449.746 | 0 | 2.566477 | 134927.3 | 133273.5 | -0.8394845 | -9.991881 | 9223372036854775807 | 132757 | 0 | 0 | 186.6485 | 0 | 238.6017 | 1342.772 | 2080.446 | -737.67833 | -582.593333 | 324791548 | 3097.721 | 0 | 0 | 9071.674 | 0 | 28157.78 | 69957.22 | 205735.44 | 53478.90827 | 32266.3316 | 23591904764966358 | 1035838 | 0 | 0 | 492200.9 | 0 | 4433783 | 3672440 | 23458348 | -3.909632e+06 | -1820248.1054 | -5836980534949280024 | 470082742 |
| SE001T002 | 2019-01-01 03:00:00 | 2019-01-01 00:00:00 UTC | 3 | NA | NA | NA | 0 | 0 | 54.03 | 52.38321 | 1.65 | 3.15 | 2 | 127833826 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 0 | 1 | NA | NA | NA | 0 | 0 | 2919.241 | 2744.001 | 2.7225 | 9.9225 | 16341487069798276 | 204.1498 | NA | NA | NA | 0 | 0 | 157726.6 | 143739.6 | 4.492125e+00 | 3.125587e+01 | 9223372036854775807 | 2916.913 | NaN | NaN | 20.86652 | 0 | 17.10417 | 51.04179 | 50.91531 | 0.1263462 | 0.1026282 | 11548567 | 119.1431 | NaN | NaN | 893.9378 | 0 | 2005.167 | 2631.750 | 2608.731 | 2.999881 | 11.009469 | 788617845578312 | 39839.92 | NaN | NaN | 43612.47 | 0 | 314021.3 | 136906.2 | 134387.0 | 3.412460e+00 | 18.12785 | 5805177761661137300 | 18080105 | NA | NA | 22.89169 | 0 | 42.94594 | 5.248282 | 4.125210 | 1.761611 | 3.382143 | 26104758 | 163.3114 | NA | NA | 1305.943 | 0 | 7118.939 | 512.3275 | 395.0199 | 3.585475 | 12.302643 | 3.213042e+15 | 88283.97 | NA | NA | 78080.08 | 0 | 1167604.6 | 38217.91 | 28824.10 | 1.444148e+01 | 95.17946 | 4.419923e+18 | 48571127 | NA | NA | 14.11977 | 0 | 0.8375 | 50.26 | 51.09287 | -0.445 | -0.880 | 4353989 | 50.9994 | NA | NA | 261.4659 | 0 | 1.402813 | 2526.071 | 2610.566 | 1.5892000 | 6.510600 | 18957220212121 | 2601.662 | NA | NA | 5445.458 | 0 | 2.349711 | 126960.7 | 133389.9 | -0.0897565 | -0.690976 | 9223372036854775807 | 132757 | 0 | 0 | 166.9322 | 0 | 239.4583 | 1327.087 | 1323.798 | 3.28500 | 2.668333 | 300262745 | 3097.721 | 0 | 0 | 7151.502 | 0 | 28072.33 | 68425.50 | 67827.01 | 77.99691 | 286.2462 | 20504063985036127 | 1035838 | 0 | 0 | 348899.7 | 0 | 4396298 | 3559561 | 3494062 | 8.872396e+01 | 471.3240 | 3360669213513156881 | 470082742 |
| SE001T002 | 2019-01-01 04:00:00 | 2019-01-01 00:00:00 UTC | 4 | NA | NA | NA | 0 | 0 | 54.03 | 53.48202 | 0.55 | 1.03 | 2 | 133924965 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 0 | 1 | NA | NA | NA | 0 | 0 | 2919.241 | 2860.327 | 0.3025 | 1.0609 | 17935896250251224 | 204.1498 | NA | NA | NA | 0 | 0 | 157726.6 | 152976.1 | 1.663750e-01 | 1.092727e+00 | 9223372036854775807 | 2916.913 | NaN | NaN | 21.14737 | 0 | 17.33774 | 50.54090 | 51.11693 | -0.5759615 | -1.2951923 | 11474866 | 119.1431 | NaN | NaN | 955.1761 | 0 | 2057.153 | 2583.759 | 2632.404 | 2.208277 | 9.073295 | 835013994369032 | 39839.92 | NaN | NaN | 50867.80 | 0 | 325783.6 | 133436.2 | 136440.9 | -1.661498e+00 | -20.74598 | 6066148384622776188 | 18080105 | NA | NA | 24.09422 | 0 | 43.49340 | 5.527340 | 4.499122 | 1.396999 | 2.773374 | 27045796 | 163.3114 | NA | NA | 1537.300 | 0 | 7293.590 | 538.1648 | 435.8653 | 2.070781 | 9.023685 | 3.519872e+15 | 88283.97 | NA | NA | 103104.76 | 0 | 1210951.6 | 40054.27 | 32172.13 | 6.981528e+00 | 59.16884 | 4.425922e+18 | 48571127 | NA | NA | 14.07329 | 0 | 0.7875 | 49.41 | 50.64713 | -1.055 | -2.105 | 3770526 | 50.9994 | NA | NA | 260.8902 | 0 | 1.240312 | 2441.352 | 2565.147 | 1.6904000 | 6.266689 | 14216866316676 | 2601.662 | NA | NA | 5440.110 | 0 | 1.953492 | 120627.5 | 129919.0 | -1.1743205 | -9.331255 | 9223372036854775807 | 132757 | 0 | 0 | 169.1790 | 0 | 242.7283 | 1314.063 | 1329.040 | -14.97500 | -33.675000 | 298346540 | 3097.721 | 0 | 0 | 7641.409 | 0 | 28800.14 | 67177.73 | 68442.50 | 57.41521 | 235.9057 | 21710363853594853 | 1035838 | 0 | 0 | 406942.4 | 0 | 4560971 | 3469340 | 3547463 | -4.319896e+01 | -539.3956 | -8300838663193783650 | 470082742 |
| SE001T002 | 2019-01-01 05:00:00 | 2019-01-01 00:00:00 UTC | 5 | NA | NA | NA | 0 | 0 | 54.03 | 52.26047 | 1.77 | 3.39 | 2 | 138980453 | 14.2881 | 2019 | 1 | 1 | 1 | Tuesday | 0 | 0 | 1 | NA | NA | NA | 0 | 0 | 2919.241 | 2731.157 | 3.1329 | 11.4921 | 19315566316085208 | 204.1498 | NA | NA | NA | 0 | 0 | 157726.6 | 142731.5 | 5.545233e+00 | 3.895822e+01 | 9223372036854775807 | 2916.913 | NaN | NaN | 22.00156 | 0 | 14.91167 | 49.31417 | 50.61987 | -1.3060256 | -2.8704487 | 10999892 | 119.1431 | NaN | NaN | 1018.2409 | 0 | 1325.509 | 2469.663 | 2578.739 | 9.665145 | 39.528089 | 871776133889391 | 39839.92 | NaN | NaN | 54019.17 | 0 | 162543.6 | 125427.0 | 132081.5 | -2.129882e+01 | -204.19967 | 4937230750272113881 | 18080105 | NA | NA | 24.70794 | 0 | 34.46750 | 6.267906 | 4.125915 | 2.877120 | 5.704398 | 27942971 | 163.3114 | NA | NA | 1532.394 | 0 | 4561.311 | 605.5978 | 390.7418 | 5.388087 | 23.925790 | 3.785983e+15 | 88283.97 | NA | NA | 99312.22 | 0 | 600275.5 | 44752.04 | 28202.90 | 3.649233e+01 | 305.38540 | 4.538961e+18 | 48571127 | NA | NA | 13.99293 | 0 | 0.9500 | 47.32 | 50.64713 | -3.040 | -6.255 | 2987287 | 50.9994 | NA | NA | 259.9153 | 0 | 1.805000 | 2239.185 | 2565.147 | 9.8284500 | 40.716500 | 8923883620369 | 2601.662 | NA | NA | 5431.238 | 0 | 3.429500 | 105958.5 | 129919.0 | -28.1172640 | -244.731254 | 9223372036854775807 | 132757 | 0 | 0 | 176.0125 | 0 | 208.7633 | 1282.168 | 1316.117 | -33.95667 | -74.631667 | 285997214 | 3097.721 | 0 | 0 | 8145.927 | 0 | 18557.12 | 64211.23 | 67047.22 | 251.29378 | 1027.7303 | 22666179481124191 | 1035838 | 0 | 0 | 432153.4 | 0 | 2275611 | 3261103 | 3434118 | -5.537694e+02 | -5309.1914 | -759209008891900407 | 470082742 |
Nueva dimensión del dataset:
| num filas | num col |
|---|---|
| 125336 | 178 |
Defición de la variable de respuesta
\[ Y_k = \left \{ \begin{matrix} 1 & \mbox{Desviacion en el tiempo t+k} \\ 0 & \mbox{Sin desviacion en el tiempo t+k}\end{matrix}\right. \]
Cabe destacar que nuestros features deben estar posicionados en un tiempo \(t\). Para el caso del modelo 1 se debe fijar \(k=1\) y para el modelo dos \(k=12\).
Creación de la variable de respuesta
Con el objetivo de optimizar el codigo se generaran las variables de respuesta para el modelo 1 y 2.
df_pred <- df_pred[order(nemotecnico_se,-date_ymdh)]
df_pred[,c("hora","mes","dia","fecha","anno","semana"):=NULL]
df_pred <- df_pred %>% setnames(old='date_ymdh',new='date_x')
df_pred[,c("Y","Y2"):=list(lag(flag,1),lag(flag,12)),by="nemotecnico_se"]
df_pred[,c("flag","weekend","date_y"):=list(as.character(flag),as.character(weekend),date_x+hours(1))]
df_pred[,camada:=as.Date(date_y) %>% format('%Y-%m-01') %>% as.Date()]
Además eliminaremos las observaciones que tengan valores nulos en su variable de respuesta u valores extraños en las covariables cmg_prog y cmg_real (las cuales se observan en el tiempo \(t\)).
df_pred_M1 <- df_pred[!is.na(Y) & !is.na(cmg_prog) & !is.na(cmg_real) & cmg_prog!=0 & cmg_real!=0 & original==1,] #df modelo 1
df_pred_M2 <- df_pred[!is.na(Y2) & !is.na(cmg_prog) & !is.na(cmg_real) & cmg_prog!=0 & cmg_real!=0 & original==1,] #df modelo 2
Tasa de observaciones con desviaciones
df_pred_M1[,.(p=mean(Y),n1=sum(Y),N=length(Y))] %>% kable() %>% kable_styling(bootstrap_options = "striped",position="center",full_width = F)
| p | n1 | N |
|---|---|---|
| 0.2441617 | 27016 | 110648 |
Entrenamiento del Modelo
Definiremos algunas funciones que se utilizaran más adelante:
df_test_train <- function(df,Y_name,param){
res <- NULL
res$df <- df
res$df[,target:=get(Y_name)]
res$df <- res$df[original==1,]
res$df[,c("date_x","date_y"):=NULL]
# test
obs_test <- sample(seq_len(nrow(res$df)), size = floor(param$p_test*nrow(res$df)))
res$df_test <- res$df[obs_test]
res$df_test[,target:=NULL]
# train
res$df_train <- res$df[-1*obs_test]
n1 <- sum(res$df_train$target);n0 <- ((n1*param$downsampling)/(1-param$downsampling)) %>% round()
res$df_train <- res$df_train[c(which(res$df_train$target==0) %>% sample(n0,replace=TRUE),which(res$df_train$target==1))]
res$df_train[,target:=NULL]
res$df[,target:=NULL]
res
}
perf_metrics <- function(Y,pred){
res <- NULL
res$M_conf <- ModelMetrics::confusionMatrix(Y,pred)
res$auc <- ModelMetrics::auc(Y,pred)
res$tpr <- ModelMetrics::tpr(Y,pred)
res$ks <- ks_stat(Y,pred)
res
}
tasa_clasif_correcta <- function(Y,pred,treshold){
M <- matrix(ncol=2,nrow=1)
M[1,1] <- ModelMetrics::tpr(Y,pred,treshold)
M[1,2] <- ModelMetrics::tnr(Y,pred,treshold)
colnames(M) <- c("tpr","tnr")
M
}
Def Parametros
param <- NULL
param$p_test <- 0.2
param$downsampling <- 0.5
param$importancia <- 0.95
param$var_names <- (df_pred %>% select(-c(Y,Y2,camada,nemotecnico_se,date_y,date_x))) %>% names()
Obtención data train y test
df_modeling_Y1 <- df_test_train(df_pred_M1,"Y",param)
Sanity Check Media Y
| Train | Test |
|---|---|
| 0.5 | 0.2425324 |
Entrenamiento del Modelo
h2o.init(min_mem_size = '1G', max_mem_size = '6G')
Pasamos las datas de train y test a h2o:
test_M1 <- as.h2o(df_modeling_Y1$df_test, destination_frame = 'test')
train_M1 <- as.h2o(df_modeling_Y1$df_train, destination_frame = 'train')
Planteamos el modelo regularizado y seleccionamos las variables más importantes hasta sumar ciero umbral de importancia acumulada, obteniendo las siguientes variables a incluir en nuestro modelo:
model_reg_M1 <- h2o.glm(y = 'Y', x = param$var_names, training_frame = train_M1, seed = 1234,family = 'binomial',
alpha = 1, lambda_search = TRUE, nlambdas = 2000, max_active_predictors = 20 ,model_id = "model_reg_M1",
remove_collinear_columns = TRUE)
vars_M1 <- h2o::h2o.varimp(model_reg_M1) %>% mutate(perCumSum =cumsum(percentage)) %>% filter(perCumSum<param$importancia)
vars_M1 <- vars_M1$variable %>% as.character()
vars_M1
## [1] "cmg_desv_cuad" "cmg_real_median"
## [3] "cmg_desv_cuad_median" "cmg_prog"
## [5] "gen_solar_total_mwh_median" "cmg_real_sum"
## [7] "cmg_desv_pct_median" "cap_inst_mw_median"
## [9] "gen_termica_total_mwh_cub_sd" "gen_termica_total_mwh_cub_mean"
## [11] "n_barras"
model_M1 <- h2o.glm(y = 'Y',x = vars_M1,training_frame = train_M1,validation_frame = test_M1,seed = 1234,
family = 'binomial',lambda=0,compute_p_values = T,model_id = "model_M1",
remove_collinear_columns = TRUE)
Evaluación de la performance
Para evaluar nuestro modelo utilizaremos las siguientes metricas AUC, KS y también pondemos atención a la matriz de confución.
pred_M1 <- (h2o.predict(object = model_M1, newdata = test_M1)$p1) %>% as.vector() %>% as.numeric()
Y_M1 <- df_modeling_Y1$df_test$Y
per_M1 <- perf_metrics(Y_M1,pred_M1)
Obtenemos las siguientes métricas en test:
| auc | ks | tpr |
|---|---|---|
| 0.7950087 | 0.4597 | 0.5802124 |
Podemos concluir que:
Gráfica AUC
rocit(score=pred_M1,class=Y_M1) %>% plot()
Gráfica KS
rocit(score=pred_M1,class=Y_M1) %>% ksplot()
Ahora veremos como se comporta usando otro threshold:
Threshold 0.5
ModelMetrics::confusionMatrix(Y_M1,pred_M1,0.5)
## [,1] [,2]
## [1,] 14445 2253
## [2,] 2317 3114
Tasas de Clasificación correctas:
| tpr | tnr |
|---|---|
| 0.5802124 | 0.8617707 |
Threshold 0.43
ModelMetrics::confusionMatrix(Y_M1,pred_M1,0.43)
## [,1] [,2]
## [1,] 12480 1527
## [2,] 4282 3840
| tpr | tnr |
|---|---|
| 0.7154835 | 0.7445412 |
Se puede visualizar que el modelo tendría una capacidad predictiva adecuada utilizando un threshold de 0.43
Para plantear el modelo dos se utilizará un random forest con todas las variables y luego se elegiran las mas imporantes las cuales seran las covariables de nuestro modelo real, a continuación se presentan dichas variables
# Definir parametros ------------------------------------------------------
param$p_test <- 0.2
param$downsampling <- 0.5
param$importancia <- 0.1
# Downsampling ------------------------------------------------------------
df_modeling_M2 <- df_test_train(df_pred_M2,"Y2",param)
df_modeling_M2$df_train <- df_modeling_M2$df_train[,Y2:=factor(Y2,levels=c("0","1"))]
df_modeling_M2$df_test <- df_modeling_M2$df_test[,Y2:=factor(Y2,levels=c("0","1"))]
df_modeling_M2$df_train[,fecha:=NULL]
df_modeling_M2$df_test[,fecha:=NULL]
# Datas H2O ---------------------------------------------------------------
test_M2 <- as.h2o(df_modeling_M2$df_test, destination_frame = 'test')
train_M2 <- as.h2o(df_modeling_M2$df_train, destination_frame = 'train')
# Random forest -----------------------------------------------------------
# Modelo con todas las variables
model_M2 <- h2o.randomForest(y = 'Y2',x = c(param$var_names),training_frame = train_M2,validation_frame = test_M2,
seed = 123,stopping_metric ="mean_per_class_error",nfolds = 0)
# Seleccion de variables más importantes
vars_M2 <- h2o::h2o.varimp(model_M2) %>% as.data.frame() %>% mutate(perCumSum=cumsum(percentage %>% as.numeric())) %>% filter(perCumSum<param$importancia)
vars_M2 <- vars_M2$variable %>% as.character()
vars_M2
## [1] "cmg_real" "cmg_real_cuad"
## [3] "cmg_real_sum" "gen_hidraulica_total_mwh_median"
## [5] "cmg_prog"
Modelo Real
model_M2 <- h2o.randomForest(y = 'Y2', x = c(vars_M2), training_frame = train_M2, validation_frame = test_M2, seed = 123,
stopping_metric ="mean_per_class_error",nfolds = 0,model_id = "model_M2")
Métricas de Perfomance en Test
pred_M2 <- (h2o.predict(object = model_M2, newdata = test_M2)$p1) %>% as.vector() %>% as.numeric()
##
|
| | 0%
|
|======================================================================| 100%
Y_M2 <- df_modeling_M2$df_test$Y2
per_M2 <- perf_metrics(Y_M2,pred_M2)
| auc | ks | tpr |
|---|---|---|
| 0.9142298 | 0.674 | 0.1790306 |
Matriz de Confusión
## [,1] [,2]
## [1,] 0 13601
## [2,] 0 2966
Si bien el auc es alto, este modelo tiene nivel de clasificación malo por lo cual no se recomienda su uso.
La base de datos de clima posee el promedio diario de varias series de tiempo, como el problema de clasificación tiene por objetivo predecir lo que ocurrira cada 12 horas, no se podrían utilizar las covariables haciendo un cruze directo por fecha pero si podrían utilizarse utilizando las covariables de lo que ocurrio el día anterior al de nuestra prediccion, por lo cual se le restara un día a la variable fecha antes de hacer el cruze.
# Leer data ---------------------------------------------------------------
df_clima <- fread('~/Desktop/test/files/datos_clima.csv') %>% data.table()
df_clima[,fecha:= fecha %>% as.POSIXct("UTC") %>% floor_date("day")-days(1)]
# Left join y calculo de nuevas covariables -------------------------------
vars_clima <- df_clima %>% select(-c(fecha,subestacion)) %>% names()
df_pred_M2[,fecha:=date_x %>% as.POSIXct("UTC") %>% floor_date("day")]
df_pred_M3 <- df_pred_M2 %>% left_join(df_clima,by=c('nemotecnico_se'='subestacion','fecha')) %>% data.table()
df_pred_M3 <- df_pred_M3 %>% mutate_at(vars(vars_clima), .funs = list(cuad = ~eleva(.x,2)))
## Note: Using an external vector in selections is ambiguous.
## ℹ Use `all_of(vars_clima)` instead of `vars_clima` to silence this message.
## ℹ See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
df_pred_M3 <- df_pred_M3 %>% mutate_at(vars(vars_clima), .funs = list(cub = ~eleva(.x,3)))
vars_clima <- c(vars_clima,paste0(vars_clima,'_cuad'),paste0(vars_clima,'_cub'))
df_pred_M3 <- df_pred_M3 %>% group_by(fecha) %>% mutate_at(vars(vars_clima), .funs = list(mean = ~mean(.x,na.rm = TRUE)))
df_pred_M3 <- df_pred_M3 %>% group_by(fecha) %>% mutate_at(vars(vars_clima), .funs = list(sd = ~sd(.x,na.rm = TRUE)))
df_pred_M3 <- df_pred_M3 %>% group_by(fecha) %>% mutate_at(vars(vars_clima), .funs = list(median = ~median(.x,na.rm = TRUE)))
df_pred_M3 <- df_pred_M3 %>% group_by(fecha) %>% mutate_at(vars(vars_clima), .funs = list(sum = ~sum(.x,na.rm = TRUE))) %>% data.table()
df_pred_M3[,fecha:=NULL]
En un comienzo utilizaremos un random forest cojn todas las covariables y seleccionaremos las más importantes para plantear un modelo final.
# Modelo ------------------------------------------------------------------
# definir parametros
param$p_test <- 0.2
param$downsampling <- 0.5
param$importancia <- 0.15
# Donwsampling ------------------------------------------------------------
df_modeling_M3 <- df_test_train(df_pred_M3,"Y2",param)
df_modeling_M3$df_train <- df_modeling_M3$df_train[,Y2:=factor(Y2,levels=c("0","1"))]
df_modeling_M3$df_train <- df_modeling_M3$df_train[,Y2:=factor(Y2,levels=c("0","1"))]
# Datas h2o ---------------------------------------------------------------
test_M3 <- as.h2o(df_modeling_M3$df_test, destination_frame = 'test')
train_M3 <- as.h2o(df_modeling_M3$df_train, destination_frame = 'train')
# Random forest -----------------------------------------------------------
model_M3 <- h2o.randomForest(y = 'Y2', x = c(param$var_names,vars_clima), training_frame = train_M3, validation_frame = test_M3, seed = 123,
stopping_metric ="mean_per_class_error",nfolds = 0)
vars_M3 <- h2o::h2o.varimp(model_M3) %>% as.data.frame() %>% mutate(perCumSum=cumsum(percentage %>% as.numeric())) %>% filter(perCumSum<param$importancia)
vars_M3 <- vars_M3$variable %>% as.character()
vars_M3
## [1] "lat"
## [2] "lat_cuad"
## [3] "gen_hidraulica_total_mwh_median"
## [4] "lat_cub"
## [5] "gen_hidraulica_total_mwh_cuad_median"
## [6] "cmg_real_sum"
## [7] "cmg_prog_mean"
## [8] "demanda_mwh_cub_sum"
## [9] "demanda_mwh_cub_sd"
## [10] "cmg_real_mean"
## [11] "gen_termica_total_mwh_mean"
## [12] "gen_termica_total_mwh_sum"
Modelo
# Leer data ---------------------------------------------------------------
df_clima <- fread('~/Desktop/test/files/datos_clima.csv') %>% data.table()
df_clima[,fecha:= fecha %>% as.POSIXct("UTC") %>% floor_date("day")-days(1)]
# Left join y calculo de nuevas covariables -------------------------------
vars_clima <- df_clima %>% select(-c(fecha,subestacion)) %>% names()
df_pred_M2[,fecha:=date_x %>% as.POSIXct("UTC") %>% floor_date("day")]
df_pred_M3 <- df_pred_M2 %>% left_join(df_clima,by=c('nemotecnico_se'='subestacion','fecha')) %>% data.table()
df_pred_M3 <- df_pred_M3 %>% mutate_at(vars(vars_clima), .funs = list(cuad = ~eleva(.x,2)))
df_pred_M3 <- df_pred_M3 %>% mutate_at(vars(vars_clima), .funs = list(cub = ~eleva(.x,3)))
vars_clima <- c(vars_clima,paste0(vars_clima,'_cuad'),paste0(vars_clima,'_cub'))
df_pred_M3 <- df_pred_M3 %>% group_by(fecha) %>% mutate_at(vars(vars_clima), .funs = list(mean = ~mean(.x,na.rm = TRUE)))
df_pred_M3 <- df_pred_M3 %>% group_by(fecha) %>% mutate_at(vars(vars_clima), .funs = list(sd = ~sd(.x,na.rm = TRUE)))
df_pred_M3 <- df_pred_M3 %>% group_by(fecha) %>% mutate_at(vars(vars_clima), .funs = list(median = ~median(.x,na.rm = TRUE)))
df_pred_M3 <- df_pred_M3 %>% group_by(fecha) %>% mutate_at(vars(vars_clima), .funs = list(sum = ~sum(.x,na.rm = TRUE))) %>% data.table()
df_pred_M3[,fecha:=NULL]
En un comienzo utilizaremos un random forest cojn todas las covariables y seleccionaremos las más importantes para plantear un modelo final.
Modelo
model_M3 <- h2o.randomForest(y = 'Y2', x = c(vars_M3), training_frame = train_M3, validation_frame = test_M3, seed = 123,
stopping_metric ="mean_per_class_error",nfolds = 0,model_id = "model_M3")
pred_M3 <- (h2o.predict(object = model_M3, newdata = test_M3)$p1) %>% as.vector() %>% as.numeric()
##
|
| | 0%
|
|======================================================================| 100%
Y_M3 <- df_modeling_M3$df_test$Y2
per_M3 <- perf_metrics(Y_M3,pred_M3)
| auc | ks | tpr |
|---|---|---|
| 0.9783445 | 0.8558 | 0.9460573 |
Matriz de confución
## [,1] [,2]
## [1,] 15083 288
## [2,] 1454 5051
Podemos concluir que:
Gráfica AUC
rocit(score=pred_M3,class=Y_M3) %>% plot()
Gráfica KS
rocit(score=pred_M3,class=Y_M3) %>% ksplot()
Al saber si existira una diferencia entre el costo programado y el real se podría generar un accionable para monitoriar los insights más importantes que podrían producir esta diferencia y mantenerlos bajo control.